Business Understanding

House price has always been an serious social issue in Taiwan, especially in Taipei city. The nation-wide house price to income ratio revealed by Taiwanese government is around 8.62 in 2020([1]). For Taipei city is around 14.39. If we compare these values with some of the cities with highest price to income ratio exposed by Demographia International Housing Affordability Survey 2019([2]), it's not hard to realize that this such a high number. This phenomena makes younger generation even impossible to own their own house in this capital city of Taiwan.

In this project, I would like to perform exploratory data analysis and then build a regression model trying to understand the housing price condition in Taipei city, Taiwan. In order to have deeper insight to the dataset, we have three major problems to solve with data mining.

1.The distribution and trend of housing price in Taipei. We will also analyze the problem on the basis of district.

2.Upon building a linear regression model to predict housing price, we wish to find out the most positively influential features that affects housing price.

3.In addition to most positively influential features, we are also curious about most negatively influential features that affects housing price.

We hope to find a possible way to solve the rising problem of housing price or at least provide some hints or guidance for those who wish to invest in real estate in Taipei city.

image.png

Data Understanding

In this project, we download the raw dataset from Ministry of the Interior([4])of Taiwan and then did some transformation in our previous data preparation notebook. We left only features of our interest and convert traditional Chinese characters into English. Alternative version of this dataset is also available in Kaggle([3]). This dataset contains some useful information about recent (bottom half of 2020) real estate transaction in Taipei city. We will discuss some more about our dataset in next few cells.

Import Libraries

Before importing our dataset of desired, we have to import necessary Python modules

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import csv 
import math
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
pd.options.display.max_columns = 50
%matplotlib inline
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Import Dataset

Now let's import our dataset and check the first few rows with head method. In our original dataset, we have **9813** transaction records and **25** features.

In [2]:
#We import dataset as "Resident_Dataset" which we already filtered out other data not for resident use.
Resident_Dataset = pd.read_csv('2020_H2_Resident_ENG.csv')
Shape = Resident_Dataset.shape

#Let's see how the dataset looks like
Resident_Dataset.head()
Out[2]:
district transaction type land shifting total area transaction year num_building num_land num_garage total floor number building state main building materials complete year building shifting total area num_room num_bathroom compartment management org carpark type carpark shifting area carpark total price main building area subsidiary building area balcony area elevator total ntd unit ntd
0 Wenshan District Land+Building 1261.30 108 1 3 0 1.0 House Reinforced Concrete NaN 245.48 0 0 0 0 No carpark 0.0 0 245.48 0.00 0.00 0 523060000 2130764
1 ZhongZheng District Land+Building 36.00 109 1 2 0 5.0 Apartment_5storey Reinforced Concrete 76.0 142.01 3 2 1 0 No carpark 0.0 0 106.57 4.81 16.47 0 18500000 130273
2 ZhongZheng District Land+Building 19.39 109 1 2 0 7.0 Apartment_5to10storey Reinforced Concrete 102.0 84.97 2 1 1 0 No carpark 0.0 0 50.17 0.00 5.05 1 21580000 253972
3 ZhongZheng District Land+Building 10.66 109 1 1 0 5.0 Apartment_5storey Strengthened Brickwork NaN 34.63 0 1 1 0 No carpark 0.0 0 33.00 1.63 0.00 0 10400000 300318
4 ZhongZheng District Land+Building 45.00 109 1 1 0 4.0 Apartment_5storey Strengthened Brickwork 58.0 99.86 3 2 1 0 No carpark 0.0 0 99.86 0.00 0.00 0 35800000 358502

This dataset can roughly be divided into three part that includes

  • Information about physical state of the transaction\ Ex: number of rooms, building state, total floor, building area...\
  • Non physical information of the transaction\ Ex: district, transaction type, complete year, transaction year...\
  • Economic information of the transaction\ Ex: total ntd, unit ntd

We will treat economic information as our target variable and use other two kinds of information as features for modeling the target. In the next part, we will dive into the features and try to handle both numerical and categorical variable.

Prepare Data

Data Transformation

In this cell, we transform some of the column into quantity per unit land or building.

Also, we do some feature engineering to create some new features for later analysis.

In [3]:
#We do not take transaction data with 0 room into consideration
Resident_Dataset = Resident_Dataset[Resident_Dataset["num_room"]!=0]

# This part average land shifting total area with number of land transaction.
Resident_Dataset["land shifting total area per unit transaction"] = Resident_Dataset["land shifting total area"]/Resident_Dataset["num_land"]
Resident_Dataset["land shifting total area per unit transaction"].replace(np.inf, np.nan, inplace = True)
Resident_Dataset["land shifting total area per unit transaction"].fillna(0,inplace = True)

# This part average building shifting total area with number of building transaction.
Resident_Dataset["building shifting total area per unit transaction"] = Resident_Dataset["building shifting total area"]/Resident_Dataset["num_building"]

#This part average some other columns with number of building
Resident_Dataset["num_room"] = Resident_Dataset["num_room"]/Resident_Dataset["num_building"]
Resident_Dataset["num_bathroom"] = Resident_Dataset["num_bathroom"]/Resident_Dataset["num_building"]
Resident_Dataset["main building area"] = Resident_Dataset["main building area"]/Resident_Dataset["num_building"]
Resident_Dataset["balcony area"] = Resident_Dataset["balcony area"]/Resident_Dataset["num_building"]
Resident_Dataset["subsidiary building area"] = Resident_Dataset["subsidiary building area"]/Resident_Dataset["num_building"]

#This part adds some derived feature to the dataset
#This line create total shifting area per unit transaction
Resident_Dataset["total shifting area per unit transaction"] = Resident_Dataset["land shifting total area per unit transaction"] + Resident_Dataset["building shifting total area per unit transaction"]

#This line create building age at transaction
Resident_Dataset["building age at transaction"] = Resident_Dataset['transaction year'] - Resident_Dataset['complete year']

#This line create main building ratio by taking main building area over building shifting total area
Resident_Dataset["main building ratio"] = Resident_Dataset["main building area"]/Resident_Dataset["building shifting total area"]

#This line create unit price without carpark price included
Resident_Dataset["unit ntd nocarpark"] = (Resident_Dataset["total ntd"] - Resident_Dataset["carpark total price"])/Resident_Dataset["building shifting total area"]

#Finally, we check first few rows of our dataset
Resident_Dataset.head()
Out[3]:
district transaction type land shifting total area transaction year num_building num_land num_garage total floor number building state main building materials complete year building shifting total area num_room num_bathroom compartment management org carpark type carpark shifting area carpark total price main building area subsidiary building area balcony area elevator total ntd unit ntd land shifting total area per unit transaction building shifting total area per unit transaction total shifting area per unit transaction building age at transaction main building ratio unit ntd nocarpark
1 ZhongZheng District Land+Building 36.00 109 1 2 0 5.0 Apartment_5storey Reinforced Concrete 76.0 142.01 3.0 2.0 1 0 No carpark 0.00 0 106.57 4.81 16.47 0 18500000 130273 18.000 142.01 160.010 33.0 0.750440 130272.516020
2 ZhongZheng District Land+Building 19.39 109 1 2 0 7.0 Apartment_5to10storey Reinforced Concrete 102.0 84.97 2.0 1.0 1 0 No carpark 0.00 0 50.17 0.00 5.05 1 21580000 253972 9.695 84.97 94.665 7.0 0.590444 253971.990114
4 ZhongZheng District Land+Building 45.00 109 1 1 0 4.0 Apartment_5storey Strengthened Brickwork 58.0 99.86 3.0 2.0 1 0 No carpark 0.00 0 99.86 0.00 0.00 0 35800000 358502 45.000 99.86 144.860 51.0 1.000000 358501.902664
6 Wanhua District Land+Building+Garage 4.46 109 1 1 1 24.0 Apartment_11storeyorgreater Reinforced Concrete 108.0 97.18 1.0 1.0 1 1 Ramp Plane 37.02 2800000 33.95 3.15 3.22 1 14750000 198637 4.460 97.18 101.640 1.0 0.349352 122967.688825
7 Wanhua District Land+Building+Garage 4.57 109 1 1 1 24.0 Apartment_11storeyorgreater Reinforced Concrete 108.0 75.59 2.0 1.0 1 1 Ramp Machinery 12.96 900000 35.45 2.61 3.75 1 13000000 193198 4.570 75.59 80.160 1.0 0.468977 160074.083874

Checking Duplicates

In this cell, we check if there are duplicated rows and found that there are 44 duplicated rows

In [4]:
#We use duplicated method to find out duplicated rows and print our results
num_duplicated = sum(Resident_Dataset.duplicated(keep = "first"))
print("There are " + str(num_duplicated) + " duplicated rows in this dataset")

#Then, we update our dataset to no duplicate one
Resident_Dataset = Resident_Dataset[~Resident_Dataset.duplicated(keep = "first")]
display(Resident_Dataset.head())

#Check again to see if there are duplicates
num_duplicated = sum(Resident_Dataset.duplicated(keep = "first"))
print("There are " + str(num_duplicated) + " duplicated rows in this dataset after cleaning")
There are 43 duplicated rows in this dataset
district transaction type land shifting total area transaction year num_building num_land num_garage total floor number building state main building materials complete year building shifting total area num_room num_bathroom compartment management org carpark type carpark shifting area carpark total price main building area subsidiary building area balcony area elevator total ntd unit ntd land shifting total area per unit transaction building shifting total area per unit transaction total shifting area per unit transaction building age at transaction main building ratio unit ntd nocarpark
1 ZhongZheng District Land+Building 36.00 109 1 2 0 5.0 Apartment_5storey Reinforced Concrete 76.0 142.01 3.0 2.0 1 0 No carpark 0.00 0 106.57 4.81 16.47 0 18500000 130273 18.000 142.01 160.010 33.0 0.750440 130272.516020
2 ZhongZheng District Land+Building 19.39 109 1 2 0 7.0 Apartment_5to10storey Reinforced Concrete 102.0 84.97 2.0 1.0 1 0 No carpark 0.00 0 50.17 0.00 5.05 1 21580000 253972 9.695 84.97 94.665 7.0 0.590444 253971.990114
4 ZhongZheng District Land+Building 45.00 109 1 1 0 4.0 Apartment_5storey Strengthened Brickwork 58.0 99.86 3.0 2.0 1 0 No carpark 0.00 0 99.86 0.00 0.00 0 35800000 358502 45.000 99.86 144.860 51.0 1.000000 358501.902664
6 Wanhua District Land+Building+Garage 4.46 109 1 1 1 24.0 Apartment_11storeyorgreater Reinforced Concrete 108.0 97.18 1.0 1.0 1 1 Ramp Plane 37.02 2800000 33.95 3.15 3.22 1 14750000 198637 4.460 97.18 101.640 1.0 0.349352 122967.688825
7 Wanhua District Land+Building+Garage 4.57 109 1 1 1 24.0 Apartment_11storeyorgreater Reinforced Concrete 108.0 75.59 2.0 1.0 1 1 Ramp Machinery 12.96 900000 35.45 2.61 3.75 1 13000000 193198 4.570 75.59 80.160 1.0 0.468977 160074.083874
There are 0 duplicated rows in this dataset after cleaning

Checking and Removing NAN

Result shows there are only 4 columns are with NaN or Null value, and proportion of it are all less than 1%

In [5]:
#We build a dataframe to exhibit the percentage of NaN or Null value 
NAN_Rate = pd.DataFrame(data={"% of NAN":Resident_Dataset.isna().sum()/Shape[0]})
NAN_Rate[NAN_Rate["% of NAN"]!=0]
Out[5]:
% of NAN
total floor number 0.001427
main building materials 0.000204
complete year 0.058800
building age at transaction 0.058800

Subsequently, we remove those rows with NaN or Null since it only occupies less than 1 % of the dataset

In [6]:
#Check NAN rows
for cols in list(Resident_Dataset.columns):
    Resident_Dataset = Resident_Dataset[~Resident_Dataset[cols].isna()]

#Print out current shape of Resident Dataset
print("Shape of Resident_Dataset is now " + str(Resident_Dataset.shape))

#Clone resident dataset for exploratory data analysis
Raw_Resident_Dataset = Resident_Dataset.copy()

#Drop columns that we will not use in modeling process
Resident_Dataset.drop(columns=["carpark total price","num_garage","carpark shifting area","carpark type","land shifting total area","building shifting total area","num_land","num_building","transaction year","complete year","total ntd","unit ntd"], inplace = True)
Resident_Dataset.head()
Shape of Resident_Dataset is now (8485, 31)
Out[6]:
district transaction type total floor number building state main building materials num_room num_bathroom compartment management org main building area subsidiary building area balcony area elevator land shifting total area per unit transaction building shifting total area per unit transaction total shifting area per unit transaction building age at transaction main building ratio unit ntd nocarpark
1 ZhongZheng District Land+Building 5.0 Apartment_5storey Reinforced Concrete 3.0 2.0 1 0 106.57 4.81 16.47 0 18.000 142.01 160.010 33.0 0.750440 130272.516020
2 ZhongZheng District Land+Building 7.0 Apartment_5to10storey Reinforced Concrete 2.0 1.0 1 0 50.17 0.00 5.05 1 9.695 84.97 94.665 7.0 0.590444 253971.990114
4 ZhongZheng District Land+Building 4.0 Apartment_5storey Strengthened Brickwork 3.0 2.0 1 0 99.86 0.00 0.00 0 45.000 99.86 144.860 51.0 1.000000 358501.902664
6 Wanhua District Land+Building+Garage 24.0 Apartment_11storeyorgreater Reinforced Concrete 1.0 1.0 1 1 33.95 3.15 3.22 1 4.460 97.18 101.640 1.0 0.349352 122967.688825
7 Wanhua District Land+Building+Garage 24.0 Apartment_11storeyorgreater Reinforced Concrete 2.0 1.0 1 1 35.45 2.61 3.75 1 4.570 75.59 80.160 1.0 0.468977 160074.083874

This cell filters our features with abnormally high quantity of rooms and bathrooms. (7 per unit building transaction in our case)

In [7]:
#According to common sense, it is abnormal for a single building to have more than 6 rooms and bathrooms
#Thus we filter out data not within this range.
Resident_Dataset = Resident_Dataset[Resident_Dataset["num_room"] <= 7]
Resident_Dataset = Resident_Dataset[Resident_Dataset["num_bathroom"] <= 7]

Distribution of Numerical Variables

First extract numerical columns from original dataset based on column data type. In this step, we also abandoned some numerical columns that is considered discrete and non-quatitative variable. There are 12 different numerical features in our dataset.

In [8]:
#We create a table listing data types of all columns 
#and pick up numerical ones based on their data type
Type_Table = pd.DataFrame(data={"dtype":Resident_Dataset.dtypes})
Numerical_Columns = list(Type_Table[Type_Table["dtype"] != "object"].index)
"num_land","num_building"
#Remove column management_org in numerical cloumns because it's dummy variable type numerical
#Remove transaction year,month and complete year because they are discrete and non-quatitative variables
try:
    Numerical_Columns.remove("management org")
    Numerical_Columns.remove("elevator")
    Numerical_Columns.remove("unit ntd nocarpark")
    Numerical_Columns.remove("compartment")
except:
    pass

#Report results of numerical columns
print("We have " + str(len(Numerical_Columns))+" numerical features.")
print("Numerical Columns are :\n" + str(Numerical_Columns))
We have 11 numerical features.
Numerical Columns are :
['total floor number', 'num_room', 'num_bathroom', 'main building area', 'subsidiary building area', 'balcony area', 'land shifting total area per unit transaction', 'building shifting total area per unit transaction', 'total shifting area per unit transaction', 'building age at transaction', 'main building ratio']

Second, we visualize the distribution of numerical variables using histogram. You can see there are a lot of right skewed distribution.

In [9]:
fig = plt.figure(figsize = [30,30])
sub_ind = 1
for cols in Numerical_Columns:
    Resident_Dataset[cols].hist(ax = plt.subplot(6,2,sub_ind), bins =25)
    plt.title(cols)
    sub_ind += 1

Statistics of Numerical Variable and Outlier (More like extreme values) Rejection

This cell creates a statistical table listing 1.5IQR information. We will reference the table to eliminate data points that is not considered normal condition at all.

In [10]:
#Create list for storing statistical value
Num_Max = []
Num_min = []
mean_lst = []
stdlist = []
stdup = []
stdlow = []
IQRup = []
IQRlow = []

#Iterate through numerical columns
for cols in Numerical_Columns:
    #Acquire basic statistical values
    mean = Resident_Dataset[cols].mean()
    std = Resident_Dataset[cols].std()
    Q1 = Resident_Dataset[cols].quantile(0.25)
    Q3 = Resident_Dataset[cols].quantile(0.75)
    mean_lst.append(mean)
    stdlist.append(std)
    #Calculate IQR limits
    IQR = Q3 - Q1
    Num_Max.append(Resident_Dataset[cols].max())
    Num_min.append(Resident_Dataset[cols].min())
    IQRup.append(Q3 + 1.5*IQR)
    IQRlow.append(Q1 - 1.5*IQR)

#Cast lists to the dataframe
Numerical_Columns_Stats = pd.DataFrame({"Feature":Numerical_Columns,"Max":Num_Max,"min":Num_min,"Mean":mean_lst,"std":stdlist ,"IQRup":IQRup,"IQRlow":IQRlow})
Numerical_Columns_Stats.index = Numerical_Columns_Stats["Feature"]
Numerical_Columns_Stats.drop(columns = ["Feature"], inplace = True)
Numerical_Columns_Stats
Out[10]:
Max min Mean std IQRup IQRlow
Feature
total floor number 38.000000 1.000000 9.567440 5.388982 25.000000 -7.000000
num_room 7.000000 0.333333 2.586193 1.083867 4.500000 0.500000
num_bathroom 7.000000 0.000000 1.648150 0.758346 3.500000 -0.500000
main building area 542.670000 0.000000 78.150816 44.000036 179.931250 -34.278750
subsidiary building area 592.210000 0.000000 2.944846 9.788576 8.181250 -4.908750
balcony area 71.870000 0.000000 8.062152 6.487620 24.180000 -9.100000
land shifting total area per unit transaction 1171.000000 0.000000 23.689662 27.887291 62.289375 -22.295625
building shifting total area per unit transaction 818.030000 0.010000 124.319895 79.172328 270.778750 -43.051250
total shifting area per unit transaction 1654.010000 0.010000 148.009557 94.867060 314.053214 -43.715357
building age at transaction 69.000000 -9.000000 25.055608 16.242810 82.500000 -33.500000
main building ratio 1.748464 0.000000 0.657473 0.206409 1.339088 -0.016730

We scan through numerical rows and cut off data point with value greater than limitation defined by 1.5IQR method.

In [11]:
#Define feature requiring cleaning
clean_columns = ["main building area", "subsidiary building area", "balcony area",\
                "land shifting total area per unit transaction", "building shifting total area per unit transaction",\
                 "total shifting area per unit transaction"]

#Iteratively clean the features
for cols in clean_columns:
    threshold = Numerical_Columns_Stats.loc[cols,"IQRup"]
    Resident_Dataset = Resident_Dataset[Resident_Dataset[cols] < threshold]

#Show the present shape of dataset 
Resident_Dataset.shape
Out[11]:
(7096, 19)

In this cell, we further clean the dataset according to our prediction goal (unit ntd nocarpark). We would like to remove some extreme data of unit_ntd column because these values may not reflect normal housing transaction behavior compared to those inside our range of interest.

We will apply 1.5*IQR method filter our data. After filtration, you can see the close to symmetry distribution of unit ntd nocarpark

In [12]:
#In order to get a more general model that can fit non-extreme cases,
#We will abandon first and last 1% of data after sorting with unit ntd.
old_unit_ntd = Resident_Dataset["unit ntd nocarpark"]
percent_1 = Resident_Dataset.shape[0]*0.01
Resident_Dataset = Resident_Dataset.sort_values(by=['unit ntd nocarpark']).iloc[int(percent_1):-int(percent_1),:]

#Plot to show distribution before and after cleaning dependent variable
plt.figure(figsize = [15,8])
plt.hist(old_unit_ntd, bins = 100,alpha=0.5)
plt.hist(Resident_Dataset["unit ntd nocarpark"], bins = 100, alpha = 0.8)
plt.xlabel("Unit ntd nocarpark")
plt.ylabel("Frequency")
plt.legend(["Original Data","Cleaned Data 2%"])
plt.title("Distribution of Unit ntd after cutting 2% data")
Out[12]:
Text(0.5, 1.0, 'Distribution of Unit ntd after cutting 2% data')

Exploratory Data Analysis

Q1. The distribution and trend of housing price in Taipei

In this part, we will analyze raw data of each district and try to observe trending of housing transaction in second half of 2020. To start we create a new dataframe to store some aggregated or average value of our variables.

In [13]:
#Create dataframe for district analysis
District_Analysis = pd.DataFrame(columns = ["district","num_building", "num_land", "land shifting total area per unit transaction", "building shifting total area per unit transaction", "main building ratio", "building age at transaction", "unit ntd nocarpark"],dtype='object')

#Iterate through district to get average or summation value of features
for dist in list(Raw_Resident_Dataset['district'].unique()):
    Temp_Series_Mean = Raw_Resident_Dataset[Raw_Resident_Dataset['district'] == dist].mean()
    Temp_Series_Sum = Raw_Resident_Dataset[Raw_Resident_Dataset['district'] == dist].sum()
    Temp_Row = {'district':dist}
    for feat in list(District_Analysis.columns):
        #We want summation value of num_building and num_land
        if feat in ["num_building","num_land"]:
            Temp_Row[feat] = Temp_Series_Sum[feat]
        #We want mean value of other features
        elif feat != 'district':
            Temp_Row[feat] = Temp_Series_Mean[feat]   
    #Append the row onto dataframe
    District_Analysis = District_Analysis.append(Temp_Row, ignore_index = True)
#Reset the index with district name
District_Analysis.index = Raw_Resident_Dataset['district'].unique()
District_Analysis
Out[13]:
district num_building num_land land shifting total area per unit transaction building shifting total area per unit transaction main building ratio building age at transaction unit ntd nocarpark
ZhongZheng District ZhongZheng District 330 480 15.150040 104.887500 0.644689 25.770701 219917.489612
Wanhua District Wanhua District 688 919 14.128712 102.969222 0.641229 20.145185 138936.890138
Neihu District Neihu District 1262 1599 28.936386 133.999541 0.632080 21.687852 165947.778320
Zhongshan District Zhongshan District 1118 1411 17.458679 108.963350 0.637029 25.073488 192964.329615
Songshan District Songshan District 500 716 19.954958 107.153867 0.743555 34.328571 210458.771003
Xinyi District Xinyi District 548 762 23.695229 119.891152 0.729130 31.419476 200383.447806
Nangang District Nangang District 323 399 28.593336 156.645095 0.584567 18.613924 171549.929035
Beitou District Beitou District 1034 1396 29.188145 134.387517 0.673854 26.093347 135898.786125
Shilin District Shilin District 879 1138 35.728385 138.465276 0.728335 30.285207 169880.057228
Daan District Daan District 825 1242 17.184504 129.757491 0.652929 25.399751 259168.739127
Wenshan District Wenshan District 828 1270 29.697434 128.379380 0.609948 24.018587 134566.651829
Datong District Datong District 380 618 12.908624 117.246769 0.611126 16.781915 171726.888779

This cell imports geojson file and folium libraries we need to make the choropleth map. Also, we translate Chinese TOWNNAME into English one.

In [ ]:
#Import modules for acquiring geojson
from urllib.request import urlopen
import json
#Import geojson from Github
with urlopen('https://raw.githubusercontent.com/g0v/twgeojson/master/json/twTown1982.geo.json') as response:
    geo_taipei = json.load(response)  #Collect geojson data from Github

#Dictionary used for translation of district name
district_dict = {"文山區":"Wenshan District",\
                                  "中正區":"ZhongZheng District",\
                                  "內湖區":"Neihu District",\
                                  "萬華區":"Wanhua District",\
                                  "中山區":"Zhongshan District",\
                                  "南港區":"Nangang District",\
                                  "大同區":"Datong District",\
                                  "松山區":"Songshan District",\
                                  "大安區":"Daan District",\
                                  "信義區":"Xinyi District",\
                                  "北投區":"Beitou District",\
                                  "士林區":"Shilin District"}

#Replace Chinese TOWNNAME to English one
for i in range(len(geo_taipei['features'])):
    if geo_taipei['features'][i]['properties']['COUNTYNAME'] == "台北市":     
        geo_taipei['features'][i]['properties']['TOWNNAME'] = district_dict[geo_taipei['features'][i]['properties']['TOWNNAME']]

In this cell, we plot the number of building transaction in each district using a choropleth map. From the map, you can see that some districts that are not in the center of Taipei(ex: Neihu, Beitou) have relatively higher amount of transaction than those lying in the center Taipei. This phenomena more or less illustrates that the housing market may be close to saturation in downtown, more and more people are seeking houses in the suburban area.

In [ ]:
#import folium
import folium

#Creaate a map for choropleth map
map0 = folium.Map(location=[25.06,121.52], zoom_start=11)

#Create choropleth map object with key on TOWNNAME
folium.Choropleth(geo_data = geo_taipei,\
    name="choropleth",\
    data = District_Analysis,\
    columns = ["district","num_building"],\
    key_on='feature.properties.TOWNNAME',\
    fill_color='YlOrRd',\
    fill_opacity=0.7,\
    line_opacity=0.5,\
    legend_name='Taipei',).add_to(map0)

#Create style_function
style_function = lambda x: {'fillColor': '#ffffff', 
                            'color':'#000000', 
                            'fillOpacity': 0.1, 
                            'weight': 0.1}

#Create highlight_function
highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

#Create popup tooltip object
NIL = folium.features.GeoJson(
    geo_taipei,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=['TOWNNAME'],
        aliases=['District'],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")))

map0.add_child(NIL)
map0.keep_in_front(NIL)
folium.LayerControl().add_to(map0)

map0

In this cell, we plot the mean housing price per square meter of each district. We found that even though the number of transaction of downtown Taipei may not grow as fast as suburban area, the average housing price is still much higher than suburban area. This may as well be another cause of why nowadays people favor suburban area than downtown.

In [ ]:
#Creaate a map for choropleth map
map1 = folium.Map(location=[25.06,121.52], zoom_start=11)

#Create choropleth map object with key on TOWNNAME
folium.Choropleth(geo_data = geo_taipei,\
    name="choropleth",\
    data = District_Analysis,\
    columns = ["district","unit ntd nocarpark"],\
    key_on='feature.properties.TOWNNAME',\
    fill_color='YlOrRd',\
    fill_opacity=0.7,\
    line_opacity=0.5,\
    legend_name='Taipei',).add_to(map1)

#Create style_function
style_function = lambda x: {'fillColor': '#ffffff', 
                            'color':'#000000', 
                            'fillOpacity': 0.1, 
                            'weight': 0.1}
#Create highlight_function
highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}
#Create popup tooltip object
NIL = folium.features.GeoJson(
    geo_taipei,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=['TOWNNAME'],
        aliases=['District'],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")))

map1.add_child(NIL)
map1.keep_in_front(NIL)
folium.LayerControl().add_to(map1)

map1

This part exhibits amount of different building state of each district.

First thing to be noticed is that very high proportion of transaction belongs to apartment type, but not houses. This is normal since the population density in Taiwan, especially Taipei, is so high that construction companies tend to make most use of land.

Secondly, the proportion of high storey building transaction is lower for suburban district ex(Beitou, Shilin, Neihu, Wenshan) Perhaps this is related to the lower average housing price. We should check if this holds after establishing our predictive model.

In [15]:
#Create building type table to store building type stats for each district
Building_Type_Table = Raw_Resident_Dataset.groupby(by=["district","building state"], as_index=False).count()

#Create table listing ratio of high storey building
age_buildingstate_table = District_Analysis[['district','building age at transaction']]
high_storey_ratio = []
for dist in list(age_buildingstate_table.index):
    Temp_Table = Building_Type_Table[Building_Type_Table["district"] == dist]
    mean = Temp_Table[Temp_Table['building state'] == 'Apartment_11storeyorgreater'].mean(axis=1)
    sums = Temp_Table["transaction type"].sum()
    ratio = mean/sums
    high_storey_ratio.append(float(ratio))
    
age_buildingstate_table.insert(2,"high storey building ratio",high_storey_ratio)    

#Import seaborn library
import seaborn as sbn

#Plot the result using barplot
plt.figure(figsize = [20,10])
plt.xticks(rotation=45)
 
Building_State_Plot = sbn.barplot(data = Building_Type_Table,x = "district",y = 'transaction type',hue = 'building state')
age_buildingstate_table['high storey building ratio'].plot(secondary_y = True, marker = 'o', style = '--',color = 'red')
Building_State_Plot.set(ylim = (0, 700))
Building_State_Plot.legend(loc='upper center')
plt.ylim([-0.35, 0.75])
plt.ylabel("high storey building ratio")
plt.legend()
Out[15]:
<matplotlib.legend.Legend at 0x1bd9f3c3b48>

Feature Selection

Correlation Analysis

Starting here, we will start our modeling process.\ We first select features for our numerical dataset. Firstly, we import necessary libraries for later regression

In [16]:
#Import regression modules
from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:30: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:167: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:284: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:862: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:1101: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:1127: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:1362: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:1602: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\least_angle.py:1738: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, positive=False):

This cell creates table for numerical variables

In [17]:
#Create numerical table
Numerical_Table = Resident_Dataset[Numerical_Columns]
Numerical_Table.head()
Out[17]:
total floor number num_room num_bathroom main building area subsidiary building area balcony area land shifting total area per unit transaction building shifting total area per unit transaction total shifting area per unit transaction building age at transaction main building ratio
2086 9.0 3.0 1.0 46.95 0.00 4.25 29.27 77.57 106.84 22.0 0.605260
6416 4.0 3.0 1.0 101.17 0.00 0.00 34.25 101.17 135.42 44.0 1.000000
3927 4.0 3.0 2.0 85.80 0.00 15.34 37.01 101.14 138.15 40.0 0.848329
7931 14.0 4.0 2.0 38.85 1.04 4.28 13.84 79.80 93.64 9.0 0.486842
1064 4.0 3.0 1.0 42.76 0.00 0.00 0.00 42.76 42.76 50.0 1.000000

This cell shows the correlation results of numerical features

In [18]:
#Add unit ntd nocarpark into dataset for correlation analysis
Numerical_Table2 = pd.concat([Numerical_Table,Resident_Dataset["unit ntd nocarpark"]], axis=1)
Numerical_Corr = pd.DataFrame(abs(Numerical_Table2.corr()["unit ntd nocarpark"]))
Numerical_Corr.columns = ["Correlation Coefficient"]
Numerical_Corr.sort_values(by = "Correlation Coefficient", ascending = False, inplace = True)

#Display correlation results of unit ntd nocarpark
display(Numerical_Corr)

#Plot the correlation results of unit ntd nocarpark
plt.figure(figsize = [20,8])
plt.xticks(rotation = 30)
plt.xlabel("Numerical Feature")
plt.ylabel("Correlation Coefficient")
plt.title("Correlation Coefficient of Features to Unit NTD")
plt.plot(Numerical_Corr, marker = 'o') 

Numerical_Columns2 = list(Numerical_Corr[Numerical_Corr["Correlation Coefficient"]>0.15].index)
Numerical_Columns2.remove("unit ntd nocarpark")

del Numerical_Table2
Correlation Coefficient
unit ntd nocarpark 1.000000
building age at transaction 0.278973
total floor number 0.268191
num_room 0.215274
land shifting total area per unit transaction 0.208782
main building ratio 0.194202
main building area 0.128230
subsidiary building area 0.098069
balcony area 0.097366
total shifting area per unit transaction 0.085438
num_bathroom 0.084828
building shifting total area per unit transaction 0.044638

Unique Categories within Categorical Features

Before converting categorical features, we first briefly inspect the number of different categories in each categorical variable.

In [19]:
#Check number of unique values (categories) in each feature
Categorical_Unique = {"Categorical_Variable":[],"Unique_Values":[]}
Categorical_Columns = []
for cols in list(Resident_Dataset.columns):
    if cols not in Numerical_Columns and cols != "unit ntd nocarpark":
        Categorical_Columns.append(cols)
        Categorical_Unique["Categorical_Variable"].append(cols)
        Categorical_Unique["Unique_Values"].append(len(Resident_Dataset[cols].unique()))
Categorical_Unique_Table = pd.DataFrame(Categorical_Unique)
Categorical_Unique_Table
Out[19]:
Categorical_Variable Unique_Values
0 district 12
1 transaction type 3
2 building state 8
3 main building materials 9
4 compartment 1
5 management org 2
6 elevator 2

Creating Dummy Variables for Categorical Features

Before turning these variables into dummy variable, we look briefly at the number of unique labels within the variable

In [20]:
data_num = Resident_Dataset.shape[0]
fig2,ax2 = plt.subplots(figsize = [30,60])   
ind = 1
#Create proportion dict for storing category percentage content
Proportion_Dict = {}

#Iterate through categorical columns
for var in Categorical_Columns:
    #Create temporary dataframe and do calculations
    Temp_DF = pd.DataFrame(Resident_Dataset[var].value_counts()).reset_index()
    Temp_DF.insert(0,"variable",[var]*Temp_DF.shape[0])
    Temp_DF.columns = ["variable","category","percentage"]
    Temp_DF["percentage"] = Temp_DF["percentage"]/data_num*100
    Temp_DF.set_index(["variable","category"],inplace=True)
    for cats in list(Resident_Dataset[var].unique()):
        if var not in Proportion_Dict.keys():
            Proportion_Dict[var] = {}
        Proportion_Dict[var][cats] = Temp_DF.loc[var,cats]
        
    #Plot the results also iteratively    
    Temp_DF.loc[var].plot(kind="bar", ax = plt.subplot(15,1,ind),title=var,fontsize=15)
    plt.xticks(rotation=15)
    plt.xlabel("Category",fontsize = 18)
    plt.ylabel("Percentage",fontsize = 18)
    plt.title(var,fontsize=22)
    ind+=2

Meaningless Feature Rejection

In this cell, we will do some feature engineering on categories in features having too few samples. Because this kind of transformation cannot provide effective behavior and may harm the regression algorithm

In [21]:
#In this cell, we will do some feature engineering on categories in features having too few samples. 
#Because this kind of transformation cannot provide effective behavior and may harm the regression 
#algorithm  

#Create a new dataset to store data for later regression
Categorical_Dataset = pd.DataFrame()

#For feature transaction_type, we will omit the category "building" and "garage" because they 
#occupied only a very minor part. Also, the concept of these two types are as well exhibited by 
#categories.
exclusion = ["management org","elevator","compartment"]
for var in Categorical_Columns:
    if var not in exclusion:
        Categorical_Dataset = pd.concat([Categorical_Dataset,pd.get_dummies(Resident_Dataset[var],prefix = var)], axis = 1, sort = False)
        for cats in Proportion_Dict[var].keys():
            value = float(Proportion_Dict[var][cats])
            if value < 1:
                Categorical_Dataset.drop(columns = [str(var) + "_" + str(cats)], inplace = True)
    else:
        Categorical_Dataset[var] = Resident_Dataset[var]
        
#Drop some other columns
Categorical_Dataset.drop(columns=["compartment","main building materials_Other"],inplace=True)   
Categorical_Dataset
Out[21]:
district_Beitou District district_Daan District district_Datong District district_Nangang District district_Neihu District district_Shilin District district_Songshan District district_Wanhua District district_Wenshan District district_Xinyi District district_ZhongZheng District district_Zhongshan District transaction type_Land+Building transaction type_Land+Building+Garage building state_Apartment_11storeyorgreater building state_Apartment_5storey building state_Apartment_5to10storey building state_Suite main building materials_Reinforced Concrete main building materials_Steel Reinforced Concrete main building materials_Strengthened Brickwork management org elevator
2086 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1
6416 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0
3927 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0
7931 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1
1064 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1110 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1
3107 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1
8997 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1
3971 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1
2155 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1

6956 rows × 23 columns

Create Final Dataset

Finally, we concatenate categorical and numerical dataset as our final dataset for regression

In [22]:
Dataset = pd.concat([Categorical_Dataset,Resident_Dataset[Numerical_Columns2]], axis = 1)
Dataset.head(5)
Out[22]:
district_Beitou District district_Daan District district_Datong District district_Nangang District district_Neihu District district_Shilin District district_Songshan District district_Wanhua District district_Wenshan District district_Xinyi District district_ZhongZheng District district_Zhongshan District transaction type_Land+Building transaction type_Land+Building+Garage building state_Apartment_11storeyorgreater building state_Apartment_5storey building state_Apartment_5to10storey building state_Suite main building materials_Reinforced Concrete main building materials_Steel Reinforced Concrete main building materials_Strengthened Brickwork management org elevator building age at transaction total floor number num_room land shifting total area per unit transaction main building ratio
2086 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 22.0 9.0 3.0 29.27 0.605260
6416 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 44.0 4.0 3.0 34.25 1.000000
3927 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 40.0 4.0 3.0 37.01 0.848329
7931 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 9.0 14.0 4.0 13.84 0.486842
1064 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 50.0 4.0 3.0 0.00 1.000000

Data Modeling

OLS Fitting

We first divide our dataset into training and test set using train_test_split method in Scikit-Learn. Then, we apply Ordinary Least Square (OLS) method in statsmodel.api module to build the model

In [23]:
#import statsmodel.api module
import statsmodels.api as sm
from  sklearn.metrics import r2_score

#Adding intercept term for dataset
Dataset["intercept"] = 1

#Divide whole dataset into x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = train_test_split(Dataset,Resident_Dataset["unit ntd nocarpark"],test_size = 0.3,random_state = 0)

#Fit the data with ordinary least square method
LR_OLS = sm.OLS(y_train, x_train)
LR_OLS_result = LR_OLS.fit()
display(LR_OLS_result.summary())
print("The Root Mean Square Error of this regression from test set is " + str(np.sqrt(mean_squared_error(LR_OLS_result.predict(x_test),y_test))))

#Build dataframe of model summary for later discussion
normalize_factor = []
for cols in list(Dataset.columns):
    Maxval = Dataset[cols].max()
    minval = Dataset[cols].min()
    if Maxval-minval == 1:
        normalize_factor.append(1)
    else:
        normalize_factor.append(Maxval-minval)

LR_OLS_Table = pd.DataFrame(LR_OLS_result.params, columns=['coef'])
LR_OLS_Table['coef_normalized'] = LR_OLS_Table['coef']*normalize_factor
LR_OLS_Table.insert(1,'P value',list(round(LR_OLS_result.pvalues,3)))

#We try to remove data column with relatively high p value to see if we can get better result(threshold 0.01)
Dataset = Dataset[LR_OLS_Table[LR_OLS_Table["P value"]<0.05].index]
OLS Regression Results
Dep. Variable: unit ntd nocarpark R-squared: 0.552
Model: OLS Adj. R-squared: 0.549
Method: Least Squares F-statistic: 220.9
Date: Wed, 24 Feb 2021 Prob (F-statistic): 0.00
Time: 20:53:48 Log-Likelihood: -58121.
No. Observations: 4869 AIC: 1.163e+05
Df Residuals: 4841 BIC: 1.165e+05
Df Model: 27
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
district_Beitou District -2.549e+04 1964.246 -12.979 0.000 -2.93e+04 -2.16e+04
district_Daan District 7.674e+04 2121.886 36.166 0.000 7.26e+04 8.09e+04
district_Datong District -129.7349 2620.269 -0.050 0.961 -5266.652 5007.183
district_Nangang District 4848.4578 3044.803 1.592 0.111 -1120.739 1.08e+04
district_Neihu District -1580.2620 1820.255 -0.868 0.385 -5148.788 1988.264
district_Shilin District 3057.2386 2120.052 1.442 0.149 -1099.027 7213.504
district_Songshan District 5.316e+04 2355.954 22.563 0.000 4.85e+04 5.78e+04
district_Wanhua District -3.2e+04 2089.178 -15.317 0.000 -3.61e+04 -2.79e+04
district_Wenshan District -2.501e+04 2069.762 -12.083 0.000 -2.91e+04 -2.1e+04
district_Xinyi District 3.534e+04 2290.178 15.429 0.000 3.08e+04 3.98e+04
district_ZhongZheng District 4.635e+04 2787.429 16.628 0.000 4.09e+04 5.18e+04
district_Zhongshan District 2.392e+04 1833.483 13.047 0.000 2.03e+04 2.75e+04
transaction type_Land+Building 7.785e+04 1.13e+04 6.916 0.000 5.58e+04 9.99e+04
transaction type_Land+Building+Garage 5.517e+04 1.13e+04 4.864 0.000 3.29e+04 7.74e+04
building state_Apartment_11storeyorgreater -2.492e+04 1.06e+04 -2.357 0.018 -4.57e+04 -4190.537
building state_Apartment_5storey -4.179e+04 7954.372 -5.254 0.000 -5.74e+04 -2.62e+04
building state_Apartment_5to10storey -2.433e+04 1.05e+04 -2.316 0.021 -4.49e+04 -3735.618
building state_Suite -2.913e+04 1.03e+04 -2.825 0.005 -4.93e+04 -8916.402
main building materials_Reinforced Concrete -1.599e+04 3830.247 -4.174 0.000 -2.35e+04 -8477.521
main building materials_Steel Reinforced Concrete 8939.2838 5140.828 1.739 0.082 -1139.074 1.9e+04
main building materials_Strengthened Brickwork 1.251e+04 5261.586 2.377 0.017 2191.636 2.28e+04
management org 3635.5140 1862.889 1.952 0.051 -16.595 7287.622
elevator -2185.3030 7563.829 -0.289 0.773 -1.7e+04 1.26e+04
building age at transaction -1591.6396 64.953 -24.504 0.000 -1718.977 -1464.302
total floor number 206.5823 242.697 0.851 0.395 -269.214 682.378
num_room -3337.4339 632.116 -5.280 0.000 -4576.669 -2098.199
land shifting total area per unit transaction 32.1199 54.170 0.593 0.553 -74.077 138.317
main building ratio 3.684e+04 4943.955 7.452 0.000 2.71e+04 4.65e+04
intercept 1.592e+05 1.35e+04 11.763 0.000 1.33e+05 1.86e+05
Omnibus: 193.000 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 579.652
Skew: -0.086 Prob(JB): 1.35e-126
Kurtosis: 4.682 Cond. No. 1.59e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.64e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
The Root Mean Square Error of this regression from test set is 36892.83661775932

This cell shows table with normalized coefficient

In [24]:
display(LR_OLS_Table.sort_values(by = 'coef_normalized', ascending = False))
coef P value coef_normalized
transaction type_Land+Building 77853.001726 0.000 77853.001726
district_Daan District 76739.479677 0.000 76739.479677
transaction type_Land+Building+Garage 55165.380970 0.000 55165.380970
district_Songshan District 53156.754383 0.000 53156.754383
district_ZhongZheng District 46350.237088 0.000 46350.237088
main building ratio 36841.986176 0.000 39139.388943
district_Xinyi District 35335.542233 0.000 35335.542233
district_Zhongshan District 23920.975908 0.000 23920.975908
main building materials_Strengthened Brickwork 12506.733590 0.017 12506.733590
main building materials_Steel Reinforced Concrete 8939.283840 0.082 8939.283840
total floor number 206.582323 0.395 7436.963630
district_Nangang District 4848.457808 0.111 4848.457808
management org 3635.513999 0.051 3635.513999
district_Shilin District 3057.238644 0.149 3057.238644
land shifting total area per unit transaction 32.119934 0.553 1991.435896
intercept 159196.430080 0.000 0.000000
district_Datong District -129.734927 0.961 -129.734927
district_Neihu District -1580.261970 0.385 -1580.261970
elevator -2185.302951 0.773 -2185.302951
main building materials_Reinforced Concrete -15986.543941 0.000 -15986.543941
num_room -3337.433907 0.000 -22249.559379
building state_Apartment_5to10storey -24333.126456 0.021 -24333.126456
building state_Apartment_11storeyorgreater -24923.318085 0.018 -24923.318085
district_Wenshan District -25009.800039 0.000 -25009.800039
district_Beitou District -25493.519999 0.000 -25493.519999
building state_Suite -29132.970558 0.005 -29132.970558
district_Wanhua District -31998.938728 0.000 -31998.938728
building state_Apartment_5storey -41789.414528 0.000 -41789.414528
building age at transaction -1591.639584 0.000 -113006.410441

Variance_Inflation_Factor Analysis (Reduce MultiCollinearity)

From the information carried out by the summary of OLS fitting, there may be strong multicollinearity problem in our model. Thus, we perform variance_inflation_factor analysis to make sure that all VIF value of our features are below 10 (typical value indicating no serious multicollinearity issue)

In [25]:
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Create new coefficient after multiplying span of the feature
Coef = LR_OLS_Table[LR_OLS_Table["P value"]<0.05]['coef_normalized']


#Remove intercept in our analysis
column = list(Dataset.columns)
column.remove("intercept")   
Temp_Dataset = Dataset[column]
abandoned_VIF = []

#Iterate until all VIFs are less than 10(typical value)
VIF_safe = False

#Run the while loop until all VIF is below 10
while VIF_safe == False:
    #Create dataframe for real time analysis 
    vif_table = pd.DataFrame()
    vif_table["Features"] = column
    vif_table["VIF"] = [variance_inflation_factor(Temp_Dataset.values, i) for i in range(Temp_Dataset.shape[1])]

    #Handle inf values
    if sum(vif_table["VIF"] == np.inf) > 0:
        column.remove(list(vif_table[vif_table["VIF"] == np.inf]["Features"])[0])
        Temp_Dataset = Temp_Dataset[column]
        
    #Handle VIF greater than 10
    elif sum(vif_table["VIF"] > 10) > 0:
        over10features = vif_table[vif_table["VIF"]>10]["Features"]
        Params = pd.DataFrame(abs(Coef[over10features]))
        Params.sort_values(by = "coef_normalized", inplace = True)
        column.remove(Params.iloc[0,:].name)
        Temp_Dataset = Temp_Dataset[column]
        abandoned_VIF.append(Params.iloc[0,:].name)
        
    #All VIFs are less than 10, set VIF_safe = True
    else: 
        VIF_safe = True
    
#delete Temp_Numerical_Table
del Temp_Dataset

#Assign numerical columns to the result after VIF analysis
Dataset = Dataset[column]

display(vif_table)
print("Abandoned features are " + str(abandoned_VIF))
Features VIF
0 district_Beitou District 1.287830
1 district_Daan District 1.300634
2 district_Songshan District 1.241928
3 district_Wanhua District 1.250511
4 district_Wenshan District 1.244690
5 district_Xinyi District 1.197679
6 district_ZhongZheng District 1.101470
7 district_Zhongshan District 1.397846
8 transaction type_Land+Building+Garage 2.143154
9 building state_Apartment_11storeyorgreater 2.182101
10 building state_Apartment_5storey 2.894925
11 building state_Suite 1.064043
12 main building materials_Strengthened Brickwork 1.127091
13 building age at transaction 7.342064
14 num_room 6.888733
Abandoned features are ['main building materials_Reinforced Concrete', 'building state_Apartment_5to10storey', 'main building ratio', 'transaction type_Land+Building']

This cell plots the final VIF of each features that were chosen.

In [26]:
#Create plots for plotting VIF results
fig0, ax0 = plt.subplots(figsize = [15,7])
ax0.set_ylim([0,9])
ax0.scatter(vif_table["Features"], vif_table["VIF"], s = 60)
ax0.set_xticklabels(list(vif_table["Features"]), rotation = 75)
ax0.set_xlabel("Features")
ax0.set_ylabel("VIF(Variance_Inflation_Factor)")
ax0.set_title("VIF Versus Features after selection")
#Add VIF values onto the feature data point
for i in range(vif_table.shape[0]):
    fea = vif_table.iloc[i,0]
    val = round(vif_table.iloc[i,1],3)
    ax0.annotate(val,(fea,val+0.2))

Iterative Regression Experiment

We carry on with 5 regressions with different random seed to see if the results are consistent. The dataframe Coef_Table was created to store coefficient results for those trials. Also, we create a new column to store the average value of those results, we will take this value as final results in later discussion. In the last part of this cell, we use a line plot to illustrate the coefficient results, the plot shows that we have a highly consistent and reproducible model.

In [27]:
#We create a metrics table to present the R square and root mean square metrics of our model
Metrics_Table = pd.DataFrame(index = ["LR_OLS_R1","LR_OLS_R2","LR_OLS_R3","LR_OLS_R4","LR_OLS_R5"],\
                             columns = ["R squared from training set", "RMSE from test set"])

Dataset["intercept"] = 1
Coef_Table = pd.DataFrame(index = list(Dataset.columns))

for i in range(1,6):
    x_train1, x_test1, y_train1, y_test1 = train_test_split(Dataset, Resident_Dataset["unit ntd nocarpark"], test_size=0.3, random_state = i)
    LR_OLS2 = sm.OLS(y_train1, x_train1)
    LR_OLS2_result = LR_OLS2.fit()
    Metrics_Table.loc["LR_OLS_R" + str(i), "R squared from training set"] = LR_OLS2_result.rsquared
    Metrics_Table.loc["LR_OLS_R" + str(i), "RMSE from test set"] = np.sqrt(mean_squared_error(LR_OLS2_result.predict(x_test1),y_test1))
    Coef_Table["LR_OLS_R" + str(i) + "_Coef"] = list(LR_OLS2_result.params)

#Calculate average coefficient value of five regressions    
Coef_Table["LR_OLS_AVG"] = Coef_Table.mean(axis = 1)

#We visualize the coefficient results with line plot
plt.figure(figsize=[20,10])
plt.xticks(rotation = 45)
plt.xlabel("Features")
plt.ylabel("Coefficient")
plt.title("Comparison of coefficients of different regression")
plt.legend(["center"])
plt.plot(Coef_Table, label = Coef_Table.columns)

display(Coef_Table)
LR_OLS_R1_Coef LR_OLS_R2_Coef LR_OLS_R3_Coef LR_OLS_R4_Coef LR_OLS_R5_Coef LR_OLS_AVG
district_Beitou District -25795.525751 -27394.291222 -28082.408821 -27858.881850 -27365.455833 -27299.312695
district_Daan District 79351.148768 77493.212104 79844.340495 79302.238436 80274.459047 79253.079770
district_Songshan District 52703.637951 51894.476031 49562.814280 49284.087978 55003.934225 51689.790093
district_Wanhua District -33210.755015 -33752.986548 -33962.532248 -33397.310915 -33091.855624 -33483.088070
district_Wenshan District -28284.781033 -28997.402285 -28383.749199 -27578.863060 -27536.752625 -28156.309640
district_Xinyi District 35950.720608 35098.087986 36717.217902 34918.302659 33666.494913 35270.164814
district_ZhongZheng District 46217.626995 44084.132229 44948.910133 44898.576056 43352.292096 44700.307502
district_Zhongshan District 21742.565769 20519.024247 20707.526095 21831.175928 21497.898199 21259.638048
transaction type_Land+Building+Garage -25001.115340 -24495.844612 -25044.153061 -24626.657716 -24973.086169 -24828.171379
building state_Apartment_11storeyorgreater 3749.322960 2799.971519 2177.237617 3289.000293 2091.592848 2821.425047
building state_Apartment_5storey -15484.321911 -15776.409887 -15306.237947 -14415.500960 -16604.307817 -15517.355705
building state_Suite -1798.754511 -501.332200 -1602.027798 -159.706340 -5269.540664 -1866.272303
main building materials_Strengthened Brickwork 34808.230914 34415.557876 30638.779427 32107.585737 31119.009526 32617.832696
building age at transaction -1373.157571 -1355.405131 -1366.066794 -1358.889141 -1353.463116 -1361.396351
num_room -2367.273852 -2203.160553 -2349.525455 -2559.057840 -2503.245609 -2396.452662
intercept 216499.693298 216654.799770 217613.788828 217014.764340 217990.042994 217154.617846

Evaluate the Results

Metrics evaluation

This cell shows metrics table that stores results of our regression

In [28]:
#Metrics table list the results of each iteration of regression
Metrics_Table
Out[28]:
R squared from training set RMSE from test set
LR_OLS_R1 0.52338 36463.092829
LR_OLS_R2 0.527212 38071.798631
LR_OLS_R3 0.533254 37998.675821
LR_OLS_R4 0.536129 38795.314283
LR_OLS_R5 0.537483 37734.910847

Final Gain(Coefficient) Evaluation

This cell creates a table listing gains of significant features (p value < 0.05) and the normalized gain(multiply with feature span). We will use this table for later discussion.

In [29]:
#Create a new dataframe
Final_Gain_Table = pd.DataFrame(columns = ["Coefficient","Coefficient_Normalized","P value"], index = list(Dataset.columns))
normalize_factor2 = []

#Get normalize factor to scale the coefficient
for fea in list(Dataset.columns):
    maxval = Dataset[fea].max()
    minval = Dataset[fea].min()
    normalize_factor2.append(maxval-minval)

#Write values into the table   
Final_Gain_Table["Coefficient"] = LR_OLS2_result.params
Final_Gain_Table["Coefficient_Normalized"] = Final_Gain_Table["Coefficient"]*normalize_factor2
Final_Gain_Table["P value"] = round(LR_OLS2_result.pvalues,5)
Final_Gain_Table
Out[29]:
Coefficient Coefficient_Normalized P value
district_Beitou District -27365.455833 -27365.455833 0.00000
district_Daan District 80274.459047 80274.459047 0.00000
district_Songshan District 55003.934225 55003.934225 0.00000
district_Wanhua District -33091.855624 -33091.855624 0.00000
district_Wenshan District -27536.752625 -27536.752625 0.00000
district_Xinyi District 33666.494913 33666.494913 0.00000
district_ZhongZheng District 43352.292096 43352.292096 0.00000
district_Zhongshan District 21497.898199 21497.898199 0.00000
transaction type_Land+Building+Garage -24973.086169 -24973.086169 0.00000
building state_Apartment_11storeyorgreater 2091.592848 2091.592848 0.13880
building state_Apartment_5storey -16604.307817 -16604.307817 0.00000
building state_Suite -5269.540664 -5269.540664 0.12476
main building materials_Strengthened Brickwork 31119.009526 31119.009526 0.00000
building age at transaction -1353.463116 -96095.881238 0.00000
num_room -2503.245609 -16688.304060 0.00002
intercept 217990.042994 0.000000 0.00000

Q2. What are most positively influential features?

Location

We first analyze the effect of location. Below cell shows the results regarding location from Coefficient table

In [30]:
#Load coefficient of location related features from coefficient table
Location_Series = pd.DataFrame(columns = ["District","Coefficient","P value"])
for index in list(LR_OLS_Table.index):
    if "district" in index:
        Location_Series = Location_Series.append({"District":index,"Coefficient":LR_OLS_Table.loc[index,"coef_normalized"], "P value":LR_OLS_Table.loc[index,"P value"]}, ignore_index=True)

Location_Series.sort_values(by = "Coefficient", ascending = False)
Out[30]:
District Coefficient P value
1 district_Daan District 76739.479677 0.000
6 district_Songshan District 53156.754383 0.000
10 district_ZhongZheng District 46350.237088 0.000
9 district_Xinyi District 35335.542233 0.000
11 district_Zhongshan District 23920.975908 0.000
3 district_Nangang District 4848.457808 0.111
5 district_Shilin District 3057.238644 0.149
2 district_Datong District -129.734927 0.961
4 district_Neihu District -1580.261970 0.385
8 district_Wenshan District -25009.800039 0.000
0 district_Beitou District -25493.519999 0.000
7 district_Wanhua District -31998.938728 0.000

You may compare our results with a screenshot of Taiwanese media talking about housing price in Taipei city in 2020. We provide the English translation of the district name on it. Our results almost follows the trend. The district with highest housing price is Daan, whereas the district with lowest housing price is Wanhua. It seems that our model successfully captures the influence of location. Also, our results fits well with our exploratory observation in previous section.

PS: The unit in the screenshot is per ping, and for our analysis we use square meter. Note that one ping is equal to 3.3 square meters.

image.png

Main building ratio

Main building ratio is the proportion of main building area over total building shifting area. Our model indicates that houses with higher main building ratio(or you can think it as less subsidiary building area) tend to have higher average price.

In [31]:
LR_OLS_Table.loc[['main building ratio']]
Out[31]:
coef P value coef_normalized
main building ratio 36841.986176 0.0 39139.388943

Main building material

The coefficient shows that building material with strengthened brickwork has relatively high positive impact. But the statement is not conclusive because in our dataset most of the data belongs to reinforced concrete while other categories only occupy very minor part. This kind of distribution makes the algorithm hard to detect the influence of each category. Therefore, we tend to be conservative to claim that main building material is highly influential.

In [32]:
#Load coefficient of location related features from coefficient table
Material_Series = pd.DataFrame(columns = ["Main Building Material","Coefficient"])
for index in list(LR_OLS_Table.index):
    if "main building materials" in index:
        Material_Series = Material_Series.append({"Main Building Material":index,"Coefficient":LR_OLS_Table.loc[index,"coef_normalized"],"P value":LR_OLS_Table.loc[index,"P value"]}, ignore_index=True)

Material_Series.sort_values(by = "Coefficient", ascending = False)
Out[32]:
Main Building Material Coefficient P value
2 main building materials_Strengthened Brickwork 12506.733590 0.017
1 main building materials_Steel Reinforced Concrete 8939.283840 0.082
0 main building materials_Reinforced Concrete -15986.543941 0.000

Q3. What are most negatively influential features?

Building age

Our result indicates that the older the house, the lower the price, which follows common sense. If we multiply the coefficient with span of feature "building age at transaction", this feature can influence around 110000 NTD in our model. This also indicates that age of house is a strong factor affecting housing price as well as the most negatively influential feature.

In [33]:
LR_OLS_Table.loc[["building age at transaction"]]
Out[33]:
coef P value coef_normalized
building age at transaction -1591.639584 0.0 -113006.410441

Building state

If we further analyze buidling state, we can see that apartment with less than 5 storey has most negative gain. In our case, building in this category are relatively low in height and has no elevator, thus it is not weird to get negative gains. The category 11storeyorgreater shows a less negative gain, indicating that people still favors higher building with elevators than suite and low storey building. This somewhat explained the exploratory observation in previous section that building type affects housing price.

In [34]:
#Load coefficient of location related features from coefficient table
Building_State = pd.DataFrame(columns = ["Building_State","Coefficient","P value"])
for index in list(LR_OLS_Table.index):
    if "building state" in index:
        Building_State = Building_State.append({"Building_State":index,"Coefficient":LR_OLS_Table.loc[index,"coef_normalized"],"P value":LR_OLS_Table.loc[index,"P value"]}, ignore_index=True)

Building_State.sort_values(by = "Coefficient", ascending = False)
Out[34]:
Building_State Coefficient P value
2 building state_Apartment_5to10storey -24333.126456 0.021
0 building state_Apartment_11storeyorgreater -24923.318085 0.018
3 building state_Suite -29132.970558 0.005
1 building state_Apartment_5storey -41789.414528 0.000

Location

As we have discussed in most positvely influential part, location also plays an important role in negatively affecting housing price. For transactions taken place in suburban area such as Wenshan, Beitou, the price tends to be lower comparing to downtown district.

Conclusion

After processing the raw data, we first start with exploratory analysis and made some observation and assumption. Subsequently, we use linear regression to help build our insight to the dataset. We are eventually able to create a predictive linear regression model with reasonable r square, RMSE and high reproducibility. At the end of the project, we ontain some explaination to answer those business problem we proposed at the beginning of this notebook.

1.The distribution and trend of housing price in Taipei.
From exploratory data analysis, we can observe a trend that more and more transactions were taking place at suburban area instead of downtown district. Nevertheless, the average unit price is much higher than suburban area. This can be explained by saturation of downtown house market, making the price not so negotiable and some of demands were shifted to suburban area with more opportunities. However, according to external investigation, the housing price at suburban Taipei is continuously rising. We are afraid that the average price may soon approach downtown level.

Interestingly, we also found that most of the building state of the transaction belongs to apartment with very few house (single family), reflecting the fact that Taipei is of extremely high population density.

2.What are the most positively influential features that affects housing price?
According the result summary of our linear regression model, we found that the most positively influential feature is location, and the result reflects the real world statistic from Taiwanese media quite well. Aside from location, main building ratio and building material also matter. However, the statement for building material is not so conclusive due to data distribution (most of the data belongs to one category).

3.What are the most negatively influential features that affects housing price?
According the result summary of our linear regression model, we found that building age is the most negatively influential feature that could affect up to 110000 NTD per square meter. Aside from building age, location is also influential as well as building state. Our model shows that building state with lower storey and no elevator gives higher negative impact on housing price. This quite makes sense that most people favors higher (sometimes) and better (with elevator) buildings when considering buying their house.